Search CORE

8 research outputs found

InstaHide: Instance-hiding Schemes for Private Distributed Learning

Author: Arora Sanjeev
Huang Yangsibo
Li Kai
Song Zhao
Publication venue
Publication date: 01/01/2020
Field of study

How can multiple distributed entities collaboratively train a shared deep net on their private data while preserving privacy? This paper introduces InstaHide, a simple encryption of training images, which can be plugged into existing distributed deep learning pipelines. The encryption is efficient and applying it during training has minor effect on test accuracy. InstaHide encrypts each training image with a "one-time secret key" which consists of mixing a number of randomly chosen images and applying a random pixel-wise mask. Other contributions of this paper include: (a) Using a large public dataset (e.g. ImageNet) for mixing during its encryption, which improves security. (b) Experimental results to show effectiveness in preserving privacy against known attacks with only minor effects on accuracy. (c) Theoretical analysis showing that successfully attacking privacy requires attackers to solve a difficult computational problem. (d) Demonstrating that use of the pixel-wise mask is important for security, since Mixup alone is shown to be insecure to some some efficient attacks. (e) Release of a challenge dataset https://github.com/Hazelsuko07/InstaHide_Challenge Our code is available at https://github.com/Hazelsuko07/InstaHideComment: ICML 202

arXiv.org e-Print Archive

Princeton University Open Access Repository

Matching-based Data Valuation for Generative Model

Author: Deng Wenglong
Huang Yangsibo
Li Xiaoxiao
Liu Benlin
Yang Jiaxi
Publication venue
Publication date: 20/04/2023
Field of study

Data valuation is critical in machine learning, as it helps enhance model transparency and protect data properties. Existing data valuation methods have primarily focused on discriminative models, neglecting deep generative models that have recently gained considerable attention. Similar to discriminative models, there is an urgent need to assess data contributions in deep generative models as well. However, previous data valuation approaches mainly relied on discriminative model performance metrics and required model retraining. Consequently, they cannot be applied directly and efficiently to recent deep generative models, such as generative adversarial networks and diffusion models, in practice. To bridge this gap, we formulate the data valuation problem in generative models from a similarity-matching perspective. Specifically, we introduce Generative Model Valuator (GMValuator), the first model-agnostic approach for any generative models, designed to provide data valuation for generation tasks. We have conducted extensive experiments to demonstrate the effectiveness of the proposed method. To the best of their knowledge, GMValuator is the first work that offers a training-free, post-hoc data valuation strategy for deep generative models

arXiv.org e-Print Archive

Privacy Implications of Retrieval-Based Language Models

Author: Chen Danqi
Gupta Samyak
Huang Yangsibo
Li Kai
Zhong Zexuan
Publication venue
Publication date: 24/05/2023
Field of study

Retrieval-based language models (LMs) have demonstrated improved interpretability, factuality, and adaptability compared to their parametric counterparts, by incorporating retrieved text from external datastores. While it is well known that parametric models are prone to leaking private data, it remains unclear how the addition of a retrieval datastore impacts model privacy. In this work, we present the first study of privacy risks in retrieval-based LMs, particularly

k

NN-LMs. Our goal is to explore the optimal design and training procedure in domains where privacy is of concern, aiming to strike a balance between utility and privacy. Crucially, we find that

k

NN-LMs are more susceptible to leaking private information from their private datastore than parametric models. We further explore mitigations of privacy risks. When privacy information is targeted and readily detected in the text, we find that a simple sanitization step would completely eliminate the risks, while decoupling query and key encoders achieves an even better utility-privacy trade-off. Otherwise, we consider strategies of mixing public and private data in both datastore and encoder training. While these methods offer modest improvements, they leave considerable room for future work. Together, our findings provide insights for practitioners to better understand and mitigate privacy risks in retrieval-based LMs. Our code is available at: https://github.com/Princeton-SysML/kNNLM_privacy

arXiv.org e-Print Archive

Sparsity-Preserving Differentially Private Training of Large Embedding Models

Author: Ghazi Badih
Huang Yangsibo
Kamath Pritish
Kumar Ravi
Manurangsi Pasin
Sinha Amer
Zhang Chiyuan
Publication venue
Publication date: 14/11/2023
Field of study

As the use of large embedding models in recommendation systems and language applications increases, concerns over user data privacy have also risen. DP-SGD, a training algorithm that combines differential privacy with stochastic gradient descent, has been the workhorse in protecting user privacy without compromising model accuracy by much. However, applying DP-SGD naively to embedding models can destroy gradient sparsity, leading to reduced training efficiency. To address this issue, we present two new algorithms, DP-FEST and DP-AdaFEST, that preserve gradient sparsity during private training of large embedding models. Our algorithms achieve substantial reductions (

10^6 \times

) in gradient size, while maintaining comparable levels of accuracy, on benchmark real-world datasets.Comment: Neural Information Processing Systems (NeurIPS) 202

arXiv.org e-Print Archive

Deep Q Learning Driven CT Pancreas Segmentation With Geometry-Aware U-Net

Author: Fei Wu
Junyi Feng
Xi Li
Yangsibo Huang
Yunze Man
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date
Field of study

Crossref